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ABSTRACT 


Background: The digital technology era demands the world to provide an excellent health system, to ensure the citizen and 
community to be alive and healthy. 


Purpose: This study proposes the application of data mining algorithm for health prediction that can eventually shape a suit- 
able health prediction system for patients. Although health care is available to everyone in the world, there is still no healthcare 
system that is completely reliable and accurate to carefully diagnose a patient on their current health issues. Even though some 
hospitals are well equipped to provide the best healthcare services to its citizens, some of the hospitals are still lacking in certain 
qualities. Consequently, patients are doubtful and uncertain when it comes to picking which hospital suits them. 


Problem: Numerous issues are faced by patients pertinent to hospitals such as being unable to provide medical services, insuf- 
ficient number of qualified medical staffs, poor communication between doctors and patients, and unorganized health records 
and data. Eventually, these issues impede the opportunity for hospitals to handle both their management and their duties steadily 
to maintain the health of every citizen and community. 


Conclusion: Patients need treatment and diagnosis that are accurate and precise for them to be able to recover back for their 
proper health and medical staffs are required to be well-equipped in their clinical knowledge and communication skills to carefully 
assess their patients to ensure good health. Therefore, application of data mining in health prediction is considered in this paper 
as the best practice to facilitate better healthcare system. 
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INTRODUCTION There are currently a lot of health institutions that have been 
developed such as hospitals and medical centres which are 
crucial to maintain and improve the health of the community 
around us. It is a prime establishment of giving proper health 
care especially for every one of us who have ever lived. For 
every illness and diseases that people may face today and 
sometime in the future, it is because of these medical insti- 
tutions and all the doctors who worked at these places that 
have made our lives physically better and healthy. Although 
hospitals now are well-equipped with their staffs working, 
there are still known issues that persist that cause the staffs to 
make the poor clinical decision that affects a patient’s health 
such as the lack of qualified doctors, unorganized health in- 
formation and poor communications between doctors and 
patients.'° 


Data mining can be described as a process of searching pat- 
terns or correlations from a large data sets to valuable in- 
formation that can solve problems and predict outcomes. It 
involves analyzing certain amount of information to locate 
certain patterns of occurrence to predict future tendencies, 
using several processes of effective data collection, ware- 
housing and computer processing. With this functionality, 
therefore, it serves a great purpose when it comes to predict- 
ing people’s health diseases especially on finding the cor- 
relation between the health information that has been given 
by both the medical staff and the patient. These finding may 
provide a beneficial advantage in the healthcare industry as it 
may be used to manage patients on their current health issues 
and for the doctor to alleviate them from their jobs.' 
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With all of these issues mentioned, solutions are needed to 
be made and that is where a health prediction system should 
be implemented which could potentially eliminate these con- 
cerning issues. Hence, the current research proposes to apply 
data mining for smart health prediction. 


LITERATURE REVIEW 


Health care institutions are essential as it provides to every 
single people in the world proper health care. Its main pur- 
pose is to improve the current health of the community that 
we have shared and created. A health care institution such 
as hospitals or medical centres would essentially consist of 
numerous of doctors that were qualified and specialize on 
treating patients of their current illness that they endured and 
to restore them to proper health. Throughout this day and 
age, new technologies have been created and developed to 
improve people’s daily life and routine, especially for health 
care. Doctors and nurses were now guided by smart health 
prediction system on the purpose of storing medical informa- 
tion that may be used for research and diagnosis. A few years 
ago, doctors were expected to use their intuition and experi- 
ence to handle every medical situation that different patients 
are facing every day. Although their current approach may 
have saved people’s lives back then, they are still prone to 
errors and wrongdoings that have endangered human life’. 
It is without a doubt a heavy burden for everyone especially 
the medical staffs to understand that several decisions could 
heavily affect other people’s lives and health, it is also why 
such system itself proves to be vital on guiding medical staff 
to make a proper clinical decision to cure and restore the hu- 
man health. 











A smart health prediction system is defined as a healthcare 
system that is intended to assist health professionals in their 
decision-making process regarding medical situations. This 
system will provide the guidance and information needed for 
doctors to diagnose patient on their medical illness and it will 
eliminate the difficulties that the doctors needed to encoun- 
ter, particularly in their clinical decision-making process. 
The system would require to gather a whole lot of medi- 
cal information that are valuable to be used on predicting a 
patient’s health status, these patterns of information will be 
analyzed by using data mining techniques in order to find 
correlations and discover new pieces of information from 
unstructured data. By using data mining tools, it will not only 
be able to produce reliable results with less time consump- 
tion and complexity but also with smart decision-making and 
useful information.’ 


MATERIALS AND METHODS 


This section explains the data mining processes and algorithms 
with its application in health prediction. It also analyses the 


prospects related to the application of data mining techniques 
in health prediction. 


Data Mining 

The health prediction system will rely on its implementation 
of data mining, which is referred to as mining knowledge 
and information from a large number of data sets. The medi- 
cal industry is just one of many fields in society that collects 
a vast amount of information that can be utilized helpfully by 
data mining. Data mining can improve the medical industry 
by eliminating current health disparities by easily providing 
answers to complex medical cases to solve and eliminate any 
time consumptions created from making a clinical decision.° 


Data mining 1s described as the process of searching certain 
patterns that comes in a database and to utilize that infor- 
mation to build predictive models, Figure 1. Its processes 
involve examining and selecting certain data from a vast data 
storage to uncover new and unknown patterns. Both M. Du- 
rairaj et.al’ mentioned that the results of data mining are of 
influence of statistics, databases, information retrieval, ma- 
chine learning and algorithms. 


Data mining is also a known process involve in Knowledge 
Discovery in the database which is the extraction of a large 
amount of data from databases. It used to operate and exam- 
ine hidden patterns and relationship that can be found from a 
large amount of data for decision-making purposes. Knowl- 
edge Discovery in Database is done in 7 sequential steps as 
shown in Figure 2’. 


Data Cleaning 

Data Cleaning is represented as the first step of KDD which 
requires to eliminate any data collected that is random, irrel- 
evant or missing in values. 





Data Integration 

Data Integration is the second step after Data Cleaning which 
takes the data the has been filtered from the previous step to 
be combined into a meaningful and useful data. 


Data Selection 

Data Selection is defined as the process of which data that 
is relevant for the analysis is selected and retrieved from the 
collection of data 


Data Transformation 

Data Transformation involves data being converted into 
forms that are required for performing different mining op- 
erations such as smoothing, normalization or aggregation 


Data Mining 
Data mining consist of examining the data for any patterns or 
rules that are useful to be extracted and obtained. 
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Pattern Evaluation 
Pattern Evaluation is defined as identifying patterns that rep- 
resent knowledge based on a given measure. 


Knowledge representation 

Knowledge representation is the last step of KDD where it 
uses visualization tools and techniques to help users under- 
stand the knowledge from the result of data mining. There 
are various data mining techniques to choose from to turn 
raw data into something useful. These are 7 techniques that 
are most commonly used involved in data mining to choose.° 








Classification 

Classification represents a data mining technique that re- 
quires to collect various information and data for their attrib- 
utes to be analyzed. Once the attributes have been identified, 
the data can be further categorized and managed, Figure 3. 


Clustering 

Clustering is a data mining technique that requires identify- 
ing data that relates to another according to its differences 
and similarities. It relies on a visual approach that shows the 
distribution of data in relation for people to understand. 


Regression 

Regression techniques involve identifying and analyzing the 
relationship between variables in a dataset. It 1s a technique 
that 1s used in aspects of data modelling. The relationship 
between variables may vary depending on its instances. 


Outlier detection 

Outlier detection or anomaly detection consists of observing 
data items in a data set for any anomalies which do not match 
a certain behaviour. For any anomalies that have been identi- 
fied, it will become easier to understand the causes of these 
anomalies to prevent them. 


Sequential Pattern 

The sequential pattern is a technique that focuses on discov- 
ering similar patterns in a data transaction during certain pe- 
riods. This technique is useful to uncover deviation in the 
data that is happening at regular intervals over time. 


Prediction 

Prediction simply involves analyzing events that are in the 
past to predict future events. So historical data that has been 
kept is used for examination to gain some insight that might 
be useful to predict what will happen in the future. 


Association Rules 
Association rules is a data mining technique that relates to 
statistics. It searches and indicates certain data for the asso- 





ciation that may be linked together between two data set for 
discovering a hidden pattern 





For a healthcare prediction system, there are plenty of data 
mining algorithms to consider. Every data mining algorithm 
will eventually produce different result between one another 
and with these results, it is used to determine the effective- 
ness and accuracy of the system. The study proposed by 
V.Kirubha and S.Manju Priya®describes an analysis of the 
application of data mining in the different medical domain 
and the algorithms that are used to predict different diseases. 
A table in Figure 3 shows the comparison of algorithms that 
were used for different disease prediction. With the result 
shown in Figure 3., it is sufficient to determine that data min- 
ing provides results that are good and useful in providing 
diagnosis for diseases when the correct data mining tools and 
techniques were applied. 





Data Mining based Health Prediction System 
For the past few years, the health industry has been growing 
significantly that leads to unsurmountable piles of data to be 
calculated. Various researchers have had different ideas and 
approaches that were distinguished by their choice of data 
mining modeling techniques and their focus on a particular 
disease. A decision support system was proposed to predict 
patients who are affected with swine flu using Naive Bayes 
as the data mining modeling technique.’ A testing dataset 
was used consisting of data with 100 patients with swine flu 
from various hospitals that are to be tested with this system. 
Figure 4 shows the following attributes that they used for the 
system and a list of swine flu symptoms to predict patients 
who have swine flu. 


The system uses Java platform which access data from the 
database and SQL query language to build and access the 
models. It is proven that Naive Bayes could identify all the 
significant medical predictors, though their research stated 
that it can be further improved with more data sets and at- 
tributes were provided for testing. Sellapan Palaniappan and 
Rafiah Awang”® attempted to develop a prediction system re- 
garding to heart disease with motivation of facing the issue 
of healthcare organizations of providing quality services at 
an affordable cost. The system that they had use three data 
mining modeling techniques of which are Decision Trees, 
Naive Bayes and Neural Network. The system they have 
proposed would provide answers on complex queries for di- 
agnosing patient suffering from heart disease by assisting the 
practitioners to make intelligent clinical decision and pro- 
viding effective and affordable treatment. They obtained a 
total of 454 testing dataset records from the Cleveland Heart 
Disease database to be tested with 3 different data mining 
modeling techniques, with 208 patients with heart disease 
and 246 without it, Figure 5. 


As shown in the figure above, they used classification matrix 
to display the amount of correct and incorrect predictions, 
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comparing both the predicted and actual values in the data- 
set. The rows represent predicted values while the columns 
represent actual values (1 for patients with heart disease, ‘0° 
for patients with no heart disease). The left-most columns 
show values predicted by the models. The diagonal values 
show correct predictions. They have summarized that Naive 
Bayes appears to be the most effective and accurate modeling 
technique followed by Neural Network and Decision Trees. 
Another attempt was done to develop a system to predict 
lung cancer using Knowledge Discovery in Database (KDD) 
to extract implicit information from data in a database.'' The 
Knowledge Discovery proves that data mining 1s included as 
an important step in its process, Figure 6. 


Four data mining models were used in this system which 
are IF-THEN Rule, Decision Tree, Bayesian classifiers 
and Neural Network. The system comes with a prediction 
approach based on statistical factors such as age, gender, 
symptoms and risk factors, tested in a historical lung cancer 
disease database. As more data mining models were applied, 
the researchers were able to compare each model with an- 
other, claiming that the Naive Bayes to be the most effective 
in their system, followed by IF-THEN rule, Decision Trees 
and Neural Network. 


From all these articles and research, the health prediction 
system that was proposed were only able to accommodate 
to particular medical diseases. The hospital and their staffs 
would require a system that could assist all patients for every 
illness that is common and universal to eliminate their trou- 
bles in decision-making and data collection. According to 
most of the article, such a system could be expanded upon 
by adding data mining techniques such as clustering or by in- 
tegrating text mining into the system. In conclusion, a health 
prediction system has been proven to be resourceful and ben- 
eficial for all doctors and medical experts by eliminating time 
consumption and the troubles in their decision-making pro- 
cess while diagnosing a patient. Various data mining models 
have to be considered and compared in order to create an in- 
telligent health prediction system. Based on the finding from 
the literature review, most health prediction system contains 
around more than one data mining algorithms to predict the 
diseases with all of it indicating Naive Bayes algorithm to be 
the data mining algorithm that produces the best and most 
accurate result out of all the other data mining algorithm. 
Generally, the data mining algorithm will be considered and 
selected depending on the size of the dataset to be tested on 
its prediction accuracy. 








RESULTS AND DISCUSSION 


The result of the system created will consist of the diseases 
and its respective accuracy level the patient is currently suf- 
fering from. With data being analyzed following the steps of 


the process of Knowledge Discovery based on Figure 2.2, 
the accuracy level of a specific disease will be based on the 
various factors such as the patient’s medical history, age, 
gender and much more. The result from the following data 
mining will be used to assist the clinical doctors for them 
to be able to cure the patient depending on the diseases that 
have higher accuracy. 


The implementation of a health prediction system will al- 
low doctors and medical staffs to alleviate their efforts on 
their clinical decision-making process by simply inputting 
the user’s health data and symptoms that they are experienc- 
ing. The system will be implemented with data-mining al- 
gorithms that may cleverly deduce the disease that they bear 
by correlating the information given by the patient with the 
health information the doctors and medical professional pro- 
vide and that is stored in a database, the entire process would 
efficiently reduce the time consumption and challenging ef- 
forts that doctors put themselves into for making a clinical 
decision. The system will also encourage patients and doc- 
tors to communicate by recommending patient to the doctors 
that are suitable to handle their diagnosis and are relevant in 
their specific medical fields. This project 1s intended to de- 
liver a user-friendly system for patients and doctors to use of 
their demands for diagnosing illness and to provide suitable 
guidance on their current health issue that they are facing. 
The software itself will only be suited and installed on PCs. 
The proposed system needs to include the functionality: 








Patient Registration 
Patients would require registering themselves for the first 
time with their username and password to use the system. 


Patient Login: Patients would require to login into their sys- 
tem with their username and password. 


Viewing Patient Details: Doctors and patients may view de- 
tails of one another to familiarize themselves. 





Disease Prediction: Determine the illness/diseases that the 
user 1s trying to describe by going through several question 
and using data mining to pinpoint the most accurate symp- 
toms. 


Search Doctors/Patients: Doctors and patients may search 
for one another according to their speciality, diseases they 
had contracted and other references. 





Providing Feedbacks: Doctors and patients may provide 
feedback that may serve as additional information to be 
viewed. 


Adding Diseases and Symptoms: Administrators may add 
new diseases and symptoms into the system for doctors and 
patients to examine. 


Doctor Login: Doctors are required to login with their user- 
name and password to use the system. 
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Doctor Registration: Administrators could add and register 
a new doctor into the system and assign their username and 
password. 


Admin Login: Doctors are required to login with their user- 
name and password to use the system. 


View Diseases: Administrators can view various disease de- 
tails stored in the database of the system. 


Sharing Information: Doctors could share information 
about a disease or patient to another doctor for verification. 


CONCLUSION 


The data mining can play a vital role in disease prediction to 
design a smart health prediction system. In medical diagno- 
sis, data mining has been widely used for predicting diseases 
through diagnosis. However, no single data mining algorithm 
is best suited to resolve the prediction issues for healthcare 
data sets. In conclusion, the combination of several data min- 
ing or hybrid version of the data mining algorithm may be a 
better approach in designing health prediction system. The 
future research may be directed towards designing a better 
data mining based model that can address healthcare with re- 
al-time healthcare datasets. This study does not encompass 
the complete analysis of all existing data mining algorithms 
and real-time healthcare dataset. Besides, the proposed 
health prediction system is not built through the comparison 
of all the data mining algorithms available in the literature. 
However, future research may be directed towards the selec- 
tion of the best suitable data mining algorithm through the 
analysis of all existing algorithms. 
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Figure 2: Process in knowledge discovery in database.’ 
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Figure 3: Comparison of data mining algorithms with different 
diseases.® 
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Figure 4: The flow of execution of the decision support sys- 
tem.? 


Figure 5: Result of classification matrix for all the three mod- 
els." 
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Figure 6: Comparison using ODANB and NB." 
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